Jean Golding Institute: PURE Data Challenge 2017

Entry by: Chris McWilliams, EngMaths

This slideshow contains my entry to the PURE Data Challenge. In it I develop a proof-of-concept for a novel approach to the study of interdisciplinarity. The approach draws inspiration from the field of network ecology.

The slideshow is rendered using reveal.js and jupyter notebook. It is organised in a series of columns. Please navigate to the bottom of a column (using down-arrow), before proceeding to the next column (using right-arrow).

The complete python code used to generate the results can be viewed in-line using the button below, but the formatting is best with code view turned off. Press F11 to view in fullscreen.

Contents:

A. The collaboration network

       Part I - Developing the network and community detection
       Part II - Studying community properties

B. The publication network

C. The mutualistic network

D. References

In [127]:
import networkx as nx
import os
import itertools
import pickle
import matplotlib.pyplot as plt
import graphlab
from graphlab import aggregate as agg
import numpy as np
from datetime import datetime
import community
from collections import OrderedDict
from IPython.display import Image, display
import sklearn.metrics as sklm
%matplotlib inline

A. The collaboration network

Part I - Developing the network and community detection.

A natural way to represent the PURE data is as a collaboration network between authors. Each node in this network represents an author, and they are connected by a link if they have co-authored a paper. Link weights are given by the number of papers co-authored by that pair of authors. We remove isolated nodes - those authors that do not collaborate with any others - since we are interested in collaborations.

We construct this network using the python package networkx. With each author plotted on the edge of a cirlce, the whole collaboration network looks like this:

In [142]:
authors = graphlab.SFrame.read_csv('170331_PURE_Data_Challenge/PURE Data Challenge/authors.csv', delimiter=',', skiprows=0, na_values=['NA','NULL'], verbose=False)
pub_authors = authors.groupby(key_columns='PUBLICATION_ID', operations={'authors':agg.CONCAT('PERSON_ID')})
#pub_authors.print_rows(num_rows=5)
In [49]:
def create_collab_net(links):
    '''Take dictionary of links and produce networkx graph.'''
    
    nodes = []
    for l in links.keys():
        if l[0] not in nodes:
            nodes.append(l[0])
        if l[1] not in nodes:
            nodes.append(l[1])
            
    G = nx.Graph()
    G.add_nodes_from(nodes)
    for l in links.keys():
        G.add_edge(*l, weight=links[l])
        
    return G
In [520]:
solo_count = 0
links = dict()
for pub in pub_authors:
    
    if len(pub['authors'])==1:
        solo_count += 1
    else:
        _links = itertools.combinations(pub['authors'],2)
        for l in _links:
            if l not in links.keys():
                links[l] = 1
            else:
                links[l] +=1
                
G = create_collab_net(links)

#print "Network created:"
print "(There are %d nodes in this network, and %d links.)" %(len(G.nodes()),len(G.edges()))
(There are 2561 nodes in this network, and 9268 links.)
In [346]:
plt.figure(figsize=(9,9))
plt.subplots_adjust(top=1.1)
nx.draw_circular(G, node_size=1, width=0.05)

tfs=15
subtitle = 'The collaboration network between authors, each node placed on outer circumference representing an author.'.replace(' ','\ ')
subtitle_l2 = 'Each link representing a collaboration between authors (weights not shown).'.replace(' ','\ ')
title = plt.title('FigA1: Collaboration network. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

Beyond an appreciation of the complexity of the collaborations ongoing at the University, there is not much to be learned from the previous network image.

We can simplify the representation by aggregating authors into groups. A natural node grouping, or partition of the network, is given by the organisation to which each author belongs.

Grouping authors into organisations, and plotting the collaborations between them, the network looks like this:

In [53]:
## We load the staff data, and assign an integer id (orgid) to each ORGANISATION_CODE that is in use.
## There is some duplication in the staff table, so we filter duplicate entries.
staff = graphlab.SFrame.read_csv('170331_PURE_Data_Challenge/PURE Data Challenge/staff.csv', delimiter=',', skiprows=0, na_values=['NA','NULL'], verbose=False)
orgs = staff['ORGANISATION_CODE'].unique()
orgdict = dict()
for ii,org in enumerate(orgs):
    orgdict[org] = ii
    
staff['orgid'] = staff['ORGANISATION_CODE'].apply(lambda org: orgdict[org])
staff = staff.groupby(key_columns='PERSON_ID', operations={
                                            'ORGANISATION_CODE':agg.SELECT_ONE('ORGANISATION_CODE'),
                                            'TYPE':agg.SELECT_ONE('TYPE'),
                                            'orgid':agg.SELECT_ONE('orgid'),
                                        })
In [54]:
def _calc_group_sizes(part):
    '''Calculates groups sizes for a given partition (dict)'''
    community_sizes = OrderedDict.fromkeys(np.unique(part.values()),0)
    for node in part.keys():
        community_sizes[part[node]] += 1
        
    return community_sizes
In [545]:
node_scaling = 1.2
partition_pure = dict(zip(staff['PERSON_ID'], staff['orgid']))
IG = community.induced_graph(graph=G, partition=partition_pure)
organisation_sizes = _calc_group_sizes(partition_pure)

plt.figure(figsize=(9,9))
pos = nx.spring_layout(IG,k=0.2)
nx.draw(IG, pos=pos, nodelist=[node for node in IG.nodes()], node_size=[node_scaling * organisation_sizes[node] for node in IG.nodes()])

tfs=15
subtitle = 'The collaboration network between university organisations.'.replace(' ','\ ')
subtitle_l2 = 'Spring layout. Node size scaled by number of authors in organisation. Link weights not shown.'.replace(' ','\ ')
title = plt.title('FigA2: Organisation induced graph. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

In the above plot, organisations are red circles - their size scaled by the number of authors they contain.

The network layout is generated using a spring algorithm, such that strongly interacting organisations are bunched together.

We notice a core of larger organisations that collaborate strongly with others, surrounded by a periphery of smaller and more insular (specialised) organisations. Three small organisations have only internal collaborations.

It is clear that there is a significant amount of research that crosses organisational boundaries.

In fact the organisational strucutre of the university is a hierarchical tree that looks like this:

In [59]:
hierarchy = graphlab.SFrame.read_csv('170331_PURE_Data_Challenge/PURE Data Challenge/org_hierarchy.csv', delimiter=',', skiprows=0, na_values=['NA','NULL'], verbose=False)
orgkey = graphlab.SFrame.read_csv('170331_PURE_Data_Challenge/PURE Data Challenge/org_key.csv', delimiter=',', skiprows=0, na_values=['NA','NULL'], verbose=False)
In [60]:
H = nx.DiGraph()

added =[]
for row in hierarchy:
    n1 = row['PARENT_ORG_CODE']
    n2 = row['CHILD_ORG_CODE']
    
    if n1 not in added:
        H.add_node(n1)
        added.append(n1)
    if n2 not in added:
        H.add_node(n2)
        added.append(n2)
        
for row in hierarchy:
    n1 = row['PARENT_ORG_CODE']
    n2 = row['CHILD_ORG_CODE']
    H.add_edge(n1,n2)
In [61]:
## Credit to: stackoverflow.com/questions/29586520/can-one-get-hierarchical-graphs-from-networkx-with-python-3
def hierarchy_pos(G, root, width=5., vert_gap = 0.2, vert_loc = 0, xcenter = 0.5, 
                  pos = None, parent = None):
    '''If there is a cycle that is reachable from root, then this will see infinite recursion.
       G: the graph
       root: the root node of current branch
       width: horizontal space allocated for this branch - avoids overlap with other branches
       vert_gap: gap between levels of hierarchy
       vert_loc: vertical location of root
       xcenter: horizontal location of root
       pos: a dict saying where all nodes go if they have been assigned
       parent: parent of this branch.'''
    if pos == None:
        pos = {root:(xcenter,vert_loc)}
    else:
        pos[root] = (xcenter, vert_loc)
    neighbors = G.neighbors(root)
    if parent != None and parent in neighbors:
        #print parent
        #print neighbors
        neighbors.remove(parent)
    if len(neighbors)!=0:
        dx = width/len(neighbors) 
        nextx = xcenter - width/2 - dx/2
        for neighbor in neighbors:
            nextx += dx
            pos = hierarchy_pos(G,neighbor, width = dx, vert_gap = vert_gap, 
                                vert_loc = vert_loc-vert_gap, xcenter=nextx, pos=pos, 
                                parent = root)
    return pos
In [549]:
pos = hierarchy_pos(H,'UNIV')    
plt.figure(figsize=(16,8))
nx.draw(H, pos=pos, with_labels=False, node_size=20)

tfs=23
subtitle = 'Hierarchical tree, with university as the root. Organisations are red circles.'.replace(' ','\ ')
subtitle_l2 = 'Arrows pointing from parent to child organisation.'.replace(' ','\ ')
title = plt.title('FigA3: University organisational structure. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

The root of the tree is the whole University, and child organisations are nested below.

The organisation membership of individuals is given in 'staff.csv', sometimes at differing levels of this hierarchy.

If we were to aggregate our collaboration network to organisations at some higher level of the tree, we would lose some of the smaller organisations and, perhaps, some of the apparent interdisciplinarity. However, for simplicity we take organisation membership at the level given in 'staff.csv'.

But why should the conventional organisational structure be the one that best represents current collaborations?

Can we find a better one?

We seek a different grouping of authors into communities that more closely resemble their research collaborations..

To do this we use community detection - identifying groups of authors that collaborate more within their group than without. We use the Louvain algorithm [1], which optimises a quantity called modularity. The groups it identifies are called communities, and the network produced by aggregating nodes into these communities is referred to as the induced graph.

The algorithm gives sequential levels of community structure - with fewer communities of increasing size at each level. The final level is the best community structure achieved by the algorithm.

For our collaboration network the algorithm produces three levels of community structure. The first level contains 407 communities, and the induced graph looks like this:

(The plotted community sizes are scaled by number of authors they contain.)

In [63]:
den = community.generate_dendrogram(G)
part0 = den[0]

part1=dict()
for key in part0.keys():
    part1[key] = den[1][part0[key]]

part2=dict()
for key in part0.keys():
    part2[key] = den[2][part1[key]]
In [352]:
IG = community.induced_graph(graph=G, partition=part0)
community_sizes = _calc_group_sizes(part0)

plt.figure(figsize=(9,9))
test_nodes = IG.nodes()
isolates  = []
for node in test_nodes:
    node_in_edges = False
    for ed in IG.edges():
        if node in ed and ed[0]!=ed[1]:
            node_in_edges = True
            break
    
    if not node_in_edges:
        isolates.append(node)

connected = [node for node in IG.nodes() if node not in isolates]        
_sub0 = IG.subgraph(nbunch=isolates)
_sub1 = IG.subgraph(nbunch=connected)

nx.draw_circular(_sub0, nodelist=_sub0.nodes(), node_size=[community_sizes[n] for n in _sub0.nodes()])
pos = nx.circular_layout(_sub1)
for p in pos.keys():
    pos[p][0] = pos[p][0]*0.75 + 0.125
    pos[p][1] = pos[p][1]*0.75 + 0.125
    
nx.draw(_sub1, pos=pos, nodelist=_sub1.nodes(), node_size=[community_sizes[n] for n in _sub1.nodes()], width=0.1)

tfs=15
subtitle = 'The collaboration network between detected communities for the level 1 partition'.replace(' ','\ ')
subtitle_l2 = 'found by Louvain algorithm. Concentric layout, isolates in outer ring.'.replace(' ','\ ')
title = plt.title('FigA4: Community induced graph (level 1). \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

The second level contains 129 communities, and the induced graph looks like this:

In [355]:
IG = community.induced_graph(graph=G, partition=part1)
community_sizes = _calc_group_sizes(part1)

plt.figure(figsize=(9,9))
test_nodes = IG.nodes()
isolates  = []
for node in test_nodes:
    node_in_edges = False
    for ed in IG.edges():
        if node in ed and ed[0]!=ed[1]:
            node_in_edges = True
            break
    
    if not node_in_edges:
        isolates.append(node)

connected = [node for node in IG.nodes() if node not in isolates]        
_sub0 = IG.subgraph(nbunch=isolates)
_sub1 = IG.subgraph(nbunch=connected)

nx.draw_circular(_sub0, nodelist=_sub0.nodes(), node_size=[community_sizes[n] for n in _sub0.nodes()])
pos = nx.circular_layout(_sub1)
for p in pos.keys():
    pos[p][0] = pos[p][0]*0.75 + 0.125
    pos[p][1] = pos[p][1]*0.75 + 0.125
    
nx.draw(_sub1, pos=pos, nodelist=_sub1.nodes(), node_size=[community_sizes[n] for n in _sub1.nodes()], width=0.1)

tfs=15
subtitle = 'The collaboration network between detected communities for the level 2 partition'.replace(' ','\ ')
subtitle_l2 = 'found by Louvain algorithm. Concentric layout, isolates in outer ring.'.replace(' ','\ ')
title = plt.title('FigA5: Community induced graph (level 2). \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

And the best partition, given by the third level, contains 110 communities. The induced graph looks like this:

In [356]:
IG = community.induced_graph(graph=G, partition=part2)
community_sizes = _calc_group_sizes(part2)

plt.figure(figsize=(9,9))
test_nodes = IG.nodes()
isolates  = []
for node in test_nodes:
    node_in_edges = False
    for ed in IG.edges():
        if node in ed and ed[0]!=ed[1]:
            node_in_edges = True
            break
    
    if not node_in_edges:
        isolates.append(node)

connected = [node for node in IG.nodes() if node not in isolates]        
_sub0 = IG.subgraph(nbunch=isolates)
_sub1 = IG.subgraph(nbunch=connected)

pos0 = nx.circular_layout(_sub0)
nx.draw(_sub0, pos=pos0, nodelist=_sub0.nodes(), node_size=[community_sizes[n] for n in _sub0.nodes()])
#nx.draw_circular(_sub0, nodelist=_sub0.nodes(), node_size=[community_sizes[n] for n in _sub0.nodes()])
pos = nx.circular_layout(_sub1)
for p in pos.keys():
    pos[p][0] = pos[p][0]*0.75 + 0.125
    pos[p][1] = pos[p][1]*0.75 + 0.125
    
nx.draw(_sub1, pos=pos, nodelist=_sub1.nodes(), node_size=[community_sizes[n] for n in _sub1.nodes()], width=0.1)

tfs=15
subtitle = 'The collaboration network between detected communities for the best community partition'.replace(' ','\ ')
subtitle_l2 = 'found by Louvain algorithm. Concentric layout, isolates in outer ring.'.replace(' ','\ ')
title = plt.title('FigA6: Community induced graph (level 3) - best partition. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

Compare this community induced graph (FigA6) to the organisation induced graph (FigA2) that we looked at initially, plotted now with the same concentric layout:

In [357]:
node_scaling = 1.2
partition_pure = dict(zip(staff['PERSON_ID'], staff['orgid']))
IG = community.induced_graph(graph=G, partition=partition_pure)
organisation_sizes = _calc_group_sizes(partition_pure)

plt.figure(figsize=(9,9))
test_nodes = IG.nodes()
isolates  = []
for node in test_nodes:
    node_in_edges = False
    for ed in IG.edges():
        if node in ed and ed[0]!=ed[1]:
            node_in_edges = True
            break
    
    if not node_in_edges:
        isolates.append(node)

_prev_sub0 = _sub0        
connected = [node for node in IG.nodes() if node not in isolates]        
_sub0 = IG.subgraph(nbunch=isolates)
_sub1 = IG.subgraph(nbunch=connected)

pos0 = pos0 ## use same circumferenc outer ring as previous plot - circular layout not working with only 3 isolates
pos0[20] = pos0[pos0.keys()[0]]
pos0[45] = pos0[pos0.keys()[28]]
pos0[47] = pos0[pos0.keys()[56]]
nx.draw(_sub0, pos=pos0, nodelist=_sub0.nodes(), node_size=[community_sizes[n] for n in _sub0.nodes()])
nx.draw_circular(_prev_sub0, nodelist=_prev_sub0.nodes(), node_size=0)
pos = nx.circular_layout(_sub1)
for p in pos.keys():
    pos[p][0] = pos[p][0]*0.75 + 0.125
    pos[p][1] = pos[p][1]*0.75 + 0.125
    
nx.draw(_sub1, pos=pos, nodelist=_sub1.nodes(), node_size=[community_sizes[n] for n in _sub1.nodes()], width=0.1)

tfs=15
subtitle = 'The collaboration network between organisations, produced by aggrgeating authors into university organisations.'.replace(' ','\ ')
subtitle_l2 = 'Same network as FigA2. Concentric layout, isolates in outer ring.'.replace(' ','\ ')
title = plt.title('FigA7: Organisation induced graph. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')
  • The community induced graph (FigA4) has 83 isolates, and 203 links.
  • The organisation induced graph (FigA5) has 3 isolates, and 542 links.

This tells us that the amount of inter-group collaboration is much higher when using the conventional organisational structure.

It seems that the community partition goes some way towards capturing the units of collaboration within the university.

However, despite the best attempts of the Louvain algorithm, a lot of collaboration between communities remains. In fact the connected component (inner circle of FigA4) has a connectance of 0.47, meaning that almost half of all potential inter-community links are realised.

It would be optimistic to try to divide authors into totally distinct units of collaboration.

However, the community strucutre determined from the collaboration network can tell us some interesting things...

  • Modularity is a measure of the extent of intra-group interaction over inter-group interaction. It is equal to 1 if all interactions occur within groups (i.e. all nodes in the induced graph are isolates). We evaluate the modularity for each of the partitions considered so far, to quantify the extent of inter-group collaboration in each case.
  • Mutual information (M.I.) is a similarity metric that captures the amount of shared information (Shanon entropy) between two variables. We calculate the mutual information between each partition and the 'organisational' partition, to determine their similarity. We use the adjusted M.I. score (calculated with scikit-learn) to account for the different numbers of groups in the partitions.

For comparison, we include a randomly assigned parition with the same number of groups as the best community partition (Louvain level 3: Fig4A).

In [283]:
partition_pure_authors_only = {author: org for author,org in zip(partition_pure.keys(),partition_pure.values()) if author in part0.keys()}
partition_summary = graphlab.SFrame()
partition_summary['Partition'] = ['University Organisations', 'Louvain Communities L1', 'Louvain Communities L2', 'Louvain Communities L3', 'Random Partition (||L3||)']

partition_list = [partition_pure_authors_only, part0, part1, part2, part_rand]
partition_summary['Modularity'] = ['%.4f' %community.modularity(graph=G, partition=part) for part in partition_list]

node_order = partition_pure_authors_only.keys()
mi = lambda part_a,part_b: sklm.mutual_info_score([part_a[node] for node in node_order], [part_b[node] for node in node_order])
ami = lambda part_a,part_b: sklm.adjusted_mutual_info_score([part_a[node] for node in node_order], [part_b[node] for node in node_order])
nmi = lambda part_a,part_b: sklm.normalized_mutual_info_score([part_a[node] for node in node_order], [part_b[node] for node in node_order])
#partition_summary['Normalised M.I.'] = ['%.4f' %nmi(partition_pure_authors_only, part) for part in partition_list]
#partition_summary['Mutual Information'] = ['%.4f' %mi(partition_pure_authors_only, part) for part in partition_list]
partition_summary['Adjusted M.I.'] = ['%.4f' %ami(partition_pure_authors_only, part) for part in partition_list]

partition_summary
Out[283]:
Partition Modularity Adjusted M.I.
University Organisations 0.5515 1.0000
Louvain Communities L1 0.6602 0.2820
Louvain Communities L2 0.7849 0.4155
Louvain Communities L3 0.7939 0.4334
Random Partition (||L3||) -0.0002 0.0025
[5 rows x 3 columns]

The modularity produced by the university's organistional structure is much higher than random, as we would expect. The community partition (L3) drammatically improves on this, containing much more of the research collaboration within 'units'.

From the mutual information (MI) we see that there is some similarity between the best community partition (L3) and the organisational partition - much more than we would expect at random. However, the MI score is sufficiently low (0.4334) that we can conclude these are two very different ways of grouping authors.

A. Part II - Studying community properties.

Studying the communities themselves can tell us about the interdisciplinary nature of the research they produce.

For example, we might look at the composition of the communities and ask which organisational groups do we see working together?

In the field of ecology the Shannon entropy is often used to calculate the diversity of an ecosystem [2], based on the relative abundance of each species present. Here we use this metric to determine the organisational diversity of each community, based on the relative abundance of each organisation it contains. For N organisations, the metric is maximal at ln(N) when all organisations in the community are present with equal abundance.

In [68]:
def _shanon(community, staff):
    '''Calculates the Shanon (organisational) diversity of a given community.'''
    
    _community_staff = staff.filter_by(column_name='PERSON_ID', values=community)
    organisational_partition = dict(zip(_community_staff['PERSON_ID'], _community_staff['orgid']))
    org_sizes = _calc_group_sizes(organisational_partition)
    
    
    total = sum(org_sizes.values())
    diversity = [(size/float(total))*np.log(size/float(total)) for size in org_sizes.values()]
    
    return - sum(diversity)
In [69]:
community_diversity_best_partition = graphlab.SFrame()
community_diversity_best_partition['comid'] = np.unique(part2.values())
div_col = []
for com in community_diversity_best_partition['comid']:
    com = [key for key in part2 if part2[key]==com]
    div_col.append(_shanon(com, staff))
    
community_diversity_best_partition['diversity'] = div_col
In [363]:
fsa =14
tfs=15
plt.figure(figsize=(12,5))

plt.subplot(1,2,1)
plt.hist(_calc_group_sizes(part2).values(), bins=20)
plt.xlabel('community size (num. authors)', fontsize=fsa)
plt.ylabel('frequency', fontsize=fsa)
plt.grid()
t1 = plt.title('FigA8(i): Distribution of community sizes.', fontsize=tfs)

plt.subplot(1,2,2)
plt.hist(community_diversity_best_partition['diversity'], bins=30)
plt.xlabel('diversity (Shanon)', fontsize=fsa)
plt.ylabel('frequency', fontsize=fsa)
plt.grid()
t2 = plt.title('FigA8(ii): Distribution of community diversities.', fontsize=tfs)
plt.tight_layout()

From the right hand plot we see that the distribution of diversities is bi-modal. It has a peak at zero, corresponding to communities that contain authors from only a single organisation. It has a second peak in diversity close to 0.7 - it turns out that this corresponds to communities of two authors from different organisations (ln(2)=0.693).

From this we learn that a significant amount of interdisciplinarity comes from pair-wise inter-organisational collaborations. Such pairings account for over a third in the peak of small commnuities seen in the left hand plot.

We now consider the most diverse community - that with the highest Shanon entropy.

The organisational compostion of the community looks like this:

In [576]:
ordered = community_diversity_best_partition.sort('diversity', ascending=False)
top_com = ordered[0]['comid']

ids_in_com = [node for node in part2.keys() if part2[node]==top_com]
com = staff.filter_by(column_name='PERSON_ID', values=ids_in_com).groupby('orgid', operations={'count':agg.COUNT('PERSON_ID'), 'code':agg.SELECT_ONE('ORGANISATION_CODE')})
plt.figure(figsize=(10,10))

#my_cols = [tuple(np.random.rand(3)) for i in range(len(com))]
my_cols = [tuple(np.random.rand(3)) for i in range(len(com))]
#my_cols = [(col[0],0,col[2]) for col in my_cols]
t=plt.pie(com['count'], labels=com['code'], labeldistance=1.15, colors=my_cols)

tfs=15
subtitle = 'Pie chart depicting organisational composition of the most Shannon diverse community.'.replace(' ','\ ')
subtitle_l2 = 'Each segment represents an organisation, labelled with its organisation code.'.replace(' ','\ ')
title = plt.title('FigA9: The most diverse community. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

Clearly there are a number of different organisations collaborating within this community.

The network of intra-community collaborations for the same community looks like this:

In [368]:
ordered = community_diversity_best_partition.sort('diversity', ascending=False)
top_com = ordered[0]['comid']

ids_in_com = [node for node in part2.keys() if part2[node]==top_com]
_sub = G.subgraph(nbunch=ids_in_com)
plt.figure(figsize=(9,9))
#nx.draw_circular(_sub, node_size=50)
pos = nx.spring_layout(_sub)
nx.draw(_sub, node_size=50, pos=pos)
hub = nx.draw_networkx_nodes(G=_sub, nodelist=[7520], node_color='green', pos=pos, node_size=100)

tfs=15
subtitle = 'Each node is an author, all are community members.'.replace(' ','\ ')
subtitle_l2 = 'Spring layout. Link weights not shown. Highest degree node in green.'.replace(' ','\ ')
title = plt.title('FigA10: Collaborations within the most diverse community. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

The above network plot uses a spring layout (as in FigA2).

It appears that there is some interesting collaboration structure within this community. For example, there are two highly connected sub-graphs. The highest degree author (plotted in green) appears to form a key bridge between these sub-graphs (although conclusions should not be made on network drawings alone!).

It would likely prove fruitful to further investigate the patterns within and between communities, both in terms of organisational and network structure.

However, due to time limitations, we finish section A with the assumption that the comunities detected are reasonably coherent units of collaboration. Once again we plot the community induced graph (for the best Louvain partition), this time with:

  • Communities coloured according to their diversity (blue=low; red=high Shanon organisational entropy).
  • Community plotted size scaled according to number of authors in the community.
  • Link width scaled according to weight (number of co-authored publications).
  • The same concentric layout.
In [376]:
node_scaling = 5
edge_scaling = 0.01
IG = community.induced_graph(graph=G, partition=part2)

plt.figure(figsize=(9,9))
cmap = 'RdBu'

test_nodes = IG.nodes()
isolates  = []
for node in test_nodes:
    node_in_edges = False
    for ed in IG.edges():
        if node in ed and ed[0]!=ed[1]:
            node_in_edges = True
            break
    
    if not node_in_edges:
        isolates.append(node)

connected = [node for node in IG.nodes() if node not in isolates]        
_sub0 = IG.subgraph(nbunch=isolates)
_sub1 = IG.subgraph(nbunch=connected)

nx.draw_circular(_sub0, nodelist=_sub0.nodes(), node_size=[node_scaling * community_sizes[n] for n in _sub0.nodes()])

pos = nx.circular_layout(_sub1)
for p in pos.keys():
    pos[p][0] = pos[p][0]*0.75 + 0.125
    pos[p][1] = pos[p][1]*0.75 + 0.125

diversity_dict = dict(zip(community_diversity_best_partition['comid'], community_diversity_best_partition['diversity']))
_temp = nx.draw_networkx_nodes(_sub1, pos=pos, nodelist=_sub1.nodes(), node_size=[node_scaling * community_sizes[n] for n in _sub1.nodes()], node_color=[diversity_dict[n] for n in _sub1.nodes()], cmap=cmap, width=0.5)
_temp = nx.draw_networkx_edges(_sub1, pos=pos, edgelist=[edge for edge in _sub1.edges()], width=[edge_scaling * _sub1.get_edge_data(edge[0], edge[1])['weight'] for edge in _sub1.edges()])

tfs=15
subtitle = 'For the best Louvain partition. Communities coloured by Shanon diversity (blue=low, red=high).'.replace(' ','\ ')
subtitle_l2 = 'Link width scaled by weight (number of publications). Node size scaled by community size.'.replace(' ','\ ')
title = plt.title('FigA11: Community induced graph. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

B. The publication network

The premise of the previous section was that we could find coherent units of collaboration by grouping auhtors into communities, based on the structure of the collaboration network. This approach yielded some initial insights that it would have been nice to spend more time exploring. But we must move on...

In this section we take the inverse of the collaboration network - now each node is a publication, and two nodes are connected if they share at least one author. We call this the publication network.

As with the collaboration network, we remove isolates (i.e. publications that share no author with any other).

(Currently this network is unweighted, but link weights could be given by the number of authors shared by two papers.)

With almost 30,000 nodes, and over 900,000 edges, the publication network is too large to meaningfully plot. But, similar to our approach in section A, we can use community detection to aggregate the publications into groups based on the authorship structure.

The premise here is that these communities of publications represent distinct fields of knowledge - and that again, these fields may differ somehow from those encoded in the organisational structure of the university.

Community detedction on the publication network produces 722 communities, and the induced graph looks like this:

In [386]:
## We generate the edge list in a separate script (pub_net_script2.py)
f = open('./publication_net_links_dict.pkl', 'rb')
_pub_links = pickle.load(f)
f.close()
In [382]:
PG = nx.Graph()

for pub in authors['PUBLICATION_ID'].unique():
    PG.add_node(pub)

for link in _pub_links:
    PG.add_edge(u=link[0], v=link[1])
    

degrees = PG.degree()
for node in PG.nodes():
    if degrees[node]==0:
        PG.remove_node(node)
In [419]:
part_pub = community.best_partition(PG)
IG_PG = community.induced_graph(graph=PG, partition=part_pub)
In [564]:
IG = IG_PG
community_sizes = _calc_group_sizes(part_pub)

node_scaling = 0.15
plt.figure(figsize=(9,9))
test_nodes = IG.nodes()
isolates  = []
for node in test_nodes:
    node_in_edges = False
    for ed in IG.edges():
        if node in ed and ed[0]!=ed[1]:
            node_in_edges = True
            break
    
    if not node_in_edges:
        isolates.append(node)

connected = [node for node in IG.nodes() if node not in isolates]        
_sub0 = IG.subgraph(nbunch=isolates)
_sub1 = IG.subgraph(nbunch=connected)

pos0 = nx.circular_layout(_sub0)
nx.draw(_sub0, pos=pos0, nodelist=_sub0.nodes(), node_size=[node_scaling * community_sizes[n] for n in _sub0.nodes()])
#nx.draw_circular(_sub0, nodelist=_sub0.nodes(), node_size=[community_sizes[n] for n in _sub0.nodes()])
pos = nx.circular_layout(_sub1)
for p in pos.keys():
    pos[p][0] = pos[p][0]*0.75 + 0.125
    pos[p][1] = pos[p][1]*0.75 + 0.125
    
nx.draw(_sub1, pos=pos, nodelist=_sub1.nodes(), node_size=[node_scaling * community_sizes[n] for n in _sub1.nodes()], width=0.3)

tfs=15
subtitle = 'The shared authorships between communities detected in the publication network.'.replace(' ','\ ')
subtitle_l2 = 'Louvain algorithm, best partition. Node size scaled by number of papers.'.replace(' ','\ ')
title = plt.title('FigB1: Community induced graph - publications. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

This network is qualitatively similar to the community induced graph from section A (FigA6).

We observe a large number of small isolated communities, and a core of larger interacting communities.

Here the isolates represent small collections of publications that share no author with any other community in the network.

The larger communities in the core should have many intra-community shared authorship links, but also retain links to other communities. These inter-community links may be of interest - representing authors that bridge fields of knowledge.

In [428]:
fsa = 15
tfs = 16
hist = plt.hist(community_sizes.values(), bins=50)
plt.yscale('log', nonposy='clip')
plt.xlabel('community size (number of publications)', fontsize=fsa)
plt.ylabel('log(frequency)', fontsize=fsa)
plt.grid()
tit = plt.title('FigB2: Distribution of community sizes.', fontsize=tfs)

Community sizes display a long-tailed distribution, with many small communities and few larger ones. Note that the y-axis scale in FigB2 is logarithmic.

Many of these smaller communities are isolates (651 isolates in total), and therefore represent distinct fields of knowledge based on authorship.

But to what extent can the larger communities be considered distinct fields of knowledge?

We now plot the network structure of the largest community:

In [467]:
top_com = 3

ids_in_com = [node for node in part_pub.keys() if part_pub[node]==top_com]
_sub = PG.subgraph(nbunch=ids_in_com)
plt.figure(figsize=(9,9))
pos = nx.spring_layout(_sub)
nx.draw(_sub, node_size=50, pos=pos)

tfs=15
subtitle = 'Each node is a publication, linked if they share authors. All are community members.'.replace(' ','\ ')
subtitle_l2 = 'Spring layout. Link weights not shown.'.replace(' ','\ ')
title = plt.title('FigB3: Publications in the largest community. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

And the community structure of the largest connected component from FigB1:

In [468]:
plt.figure(figsize=(9,9))

pos = nx.spring_layout(_sub1)
nx.draw(_sub1, node_size=50, pos=pos)

tfs=15
subtitle = 'Each node is a community of publications, linked if they share authors.'.replace(' ','\ ')
subtitle_l2 = 'Spring layout. Link weights not shown.'.replace(' ','\ ')
title = plt.title('FigB4: The connected component. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')

There seems to be some interesting structure in these two plots. For example, the connected component of communities shown in FigB4 has a familiar core-periphery structure. It may be the case that the peripheral communities represent more distinct fields of knowledge, whereas in the core it becomes harder to disambiguate.

It may well be informative to explore the structures for the larger communities (inner ring of FigB1), but such exploration is beyond the scope of this work.

It would also be interesting to look at the similarity of pubilications within these communities - do they come from the same or similar journals? Do they share keywords?

If these communities did prove themselves to be distinct fields or knowledge, could we then identify truly interdisciplinary authors as those which link these distinct fields?

C. The bipartite network

We now look to a novel representation of interdisciplinarity, that extends on the approach developed in sections A and B.

So far we have used community detection to try and detect: (1) distinct units of collaboration (groups of authors), and (2) distinct fields of knowledge (groups of papers).

We now combine these groupings by constructing a bipartite network - that is, a network with two different types of node. Each node can only interact with those of the other type.

In community ecology bipartite networks are used, among other things, to encode information about mutualistic systems.

For example, a pollination network encodes information about pollination, at a system level. Pollination is a mutualistic interaction since both parties benefit. Therefore the network is said to be mutualistic.

Here's a simple example from [3], showing some insects that pollinate flowering plants:

(There is an added level here - which we can ignore - that some of these interactions occur only in the day and some only at night.)

In [476]:
im = Image('./mut_net_example.jpeg')
sh = display(im)

The main assumptions here are that:

  • The animals and plants are grouped into coherent species.
  • Each pollinator species only interacts with certain plant species.
  • Something meaningful can be learned from studying the network structure formed by these interactions.

We now make the imaginative leap that academic research can be thought of in a similar way.

The analogy is that our fields of knowledge are like plants, being 'pollinated' by groups of authors. By studying which groups pollinate in which fields we can learn about the nature of research at the system level, and in particular about interdisciplinarity.

Clearly the structure of the bipartite network will depend strongly on our choice of grouping. That is, how we define fields of knowledge and groups of authors. But luckily we have already spent some time doing just that! (It would be possible to spend many more person-hours working on the grouping problem.)

So, to construct the network, the publications are grouped into the same commnuities depicted in FigB1. This gives our fields of knowledge ('plant species').

To group the authors (into 'pollinator species') we have two choices:

1) Grouping the authors into the university organisations (as in FigA7).

2) Grouping the authors into the communities detected (as in FigA11).

Using the organisations, the bipartite network looks like this:

In [450]:
author_partition = graphlab.SFrame()
author_partition['PERSON_ID'] = part2.keys()
author_partition['COM'] = [part2[pid] for pid in author_partition['PERSON_ID']]
In [451]:
pub_partition = graphlab.SFrame()
pub_partition['PUBLICATION_ID'] = part_pub.keys()
pub_partition['PCOM'] = [part_pub[pid] for pid in pub_partition['PUBLICATION_ID']]
In [452]:
community_interacions = authors.join(pub_partition, on='PUBLICATION_ID', how='inner')
community_interacions = community_interacions.join(author_partition, on='PERSON_ID', how='inner')
community_interacions = community_interacions.join(staff, on='PERSON_ID', how='inner')
In [453]:
r_edges_coms = community_interacions.groupby(['PCOM','COM'],operations={'FREQ':agg.COUNT('PUBLICATION_ID')})
r_edges_coms.save('r_edges_coms.csv')
r_edges_orgs = community_interacions.groupby(['PCOM','orgid'],operations={'FREQ':agg.COUNT('PUBLICATION_ID')})
r_edges_orgs.save('r_edges_orgs.csv')
In [486]:
## plotting is done in R using package 'bipartite'
tfs=15
plt.figure(figsize=(0.1,0.1))
subtitle = 'Author organisations in blue, publication communities in green.'.replace(' ','\ ')
subtitle_l2 = 'Thickness of nodes and links represent number of publications.'.replace(' ','\ ')
title = plt.title('FigC1: Bipartite network with author organisations. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')
t = plt.axis('off')

im=Image('./bipartite_organisations.png')
display(im)

And grouping the authors based on the best Louvain communities (as shown in FigA11), the bipartite network looks like this:

In [487]:
## plotting is done in R using package 'bipartite'
tfs=15
plt.figure(figsize=(0.1,0.1))
subtitle = 'Author communities in blue, publication communities in green.'.replace(' ','\ ')
subtitle_l2 = 'Thickness of nodes and links represent number of publications.'.replace(' ','\ ')
title = plt.title('FigC2: Bipartite network with author communities. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')
t = plt.axis('off')

im=Image('./bipartite_communities.png')
display(im)

Visually there are clear differences between the two representations.

The plotting algorithm attempts to order nodes to minimise the number of links that cross. The ordering appears to be more successful in FigC1 than FigC2. This is due to the different network structures - as discussed in section A, there is more apparent interdisciplinarity when using the university's organisational structure, than when using the detected communities. The same conclusion is apparent from looking at these two figures.

FigC2 shows a clear pattern: communities of authors (units of collaboration) working mainly on well defined groups of publications (distinct fields of knowledge). What remains, appearing almost as noise in the image, are the crossing links - those that connect units of collaboration with other fields of knowledge. These crossing links, perhaps, are the thing that would be most interesting to study further.

A number of metrics have been developed in ecology for the analysis of bipartite networks.

We briefly consider the following (as calculated using the method networklevel() in the bipartite package):

  • Connectance: as usual, the realised proportion of possible links.
  • Number of compartments: the number of subsets of the web that are not connected to another compartment.
  • Generality: a weighted 'effective number of prey' (measure of diet breadth).
  • Vulnerability: a weighted 'effective number of predators' (measure of predation pressure).
In [518]:
bipartite_comparison = graphlab.SFrame()
bipartite_comparison['Network'] = ['FigC1 (Organisations)', 'FigC2 (Communities)']
bipartite_comparison['Connectance'] = [0.049, 0.026]
bipartite_comparison['Num. Comp.'] = [2, 79]
bipartite_comparison['Gener.'] = [6.13, 3.35]
bipartite_comparison['Vulner.'] = [3.30, 1.90]

bipartite_comparison
Out[518]:
Network Connectance Num. Comp. Gener. Vulner.
FigC1 (Organisations) 0.049 2 6.13 3.3
FigC2 (Communities) 0.026 79 3.35 1.9
[2 rows x 5 columns]

Author organisations have a higher 'generality' compared to the author communities. Similarly publication communities have a higher 'vulnerability' to organisations, than to author communities. In other words, the author communities are more specialised (or focused) in their authorship.

The bipartite network with author communities (FigC2) has a lower connectance, and a much higher number of disconnected compartments.

The values of these metrics support our previous conclusions about the use of Organisations versus Communities - there is more apparent interdisciplinarity when unsing the organisations.

In [511]:
## For final plot:
r_edges_final = community_interacions.groupby(['PCOM','COM'],operations={'FREQ':agg.COUNT('PUBLICATION_ID')})

coms_appear_once = r_edges_final.groupby('COM', operations={'cnt':agg.COUNT('PCOM')})
pcoms_appear_once= r_edges_final.groupby('PCOM', operations={'cnt':agg.COUNT('COM')})

coms_appear_once = coms_appear_once[coms_appear_once['cnt']==1]['COM']
pcoms_appear_once = pcoms_appear_once[pcoms_appear_once['cnt']==1]['PCOM']

r_edges_final = r_edges_final.add_row_number()
remove_rows  = []
for row in r_edges_final:
    if row['COM'] in coms_appear_once and row['PCOM'] in pcoms_appear_once:
        remove_rows.append(row['id'])

r_edges_final = r_edges_final.filter_by(column_name='id', values=remove_rows, exclude=True)
r_edges_final = r_edges_final.remove_column('id')
r_edges_final.save('r_edges_connected_cpt.csv')

It would be fascinating to push the analysis further, and with more rigour. And to apply more of the tools from network ecology to make sense of this network. To study the effects of different groupings, look for temporal structures in the data, and attempt to detect missing links, or suggest new ones.

But, unfortunately, I have my own role to play in this network. Those fields of knowledge to won't pollinate themselves!

We fininsh with a bipartite plot of the largest connected component from FigC2, to get an intuitive feel for the interdisciplinarity that remains (despite our best efforts of at grouping):

In [516]:
## plotting is done in R using package 'bipartite'
tfs=15
plt.figure(figsize=(0.1,0.1))
subtitle = 'Of bipartite network with author communities (FigC2).'.replace(' ','\ ')
subtitle_l2 = 'Author communities in blue, publication communities in green.'.replace(' ','\ ')
title = plt.title('FigC3: Largest connected component. \n\n\t$%s$\n\t$%s$' %(subtitle,subtitle_l2), fontsize=tfs, loc='left')
t = plt.axis('off')

im=Image('./bipartite_connected.png')
display(im)

D. References:

[1] Blondel, Vincent D., et al. "Fast unfolding of communities in large networks." Journal of statistical mechanics: theory and experiment 2008.10 (2008): P10008.

[2] Hill, Mark O. "Diversity and evenness: a unifying notation and its consequences." Ecology 54.2 (1973): 427-432.

[3] Macgregor, Callum J., et al. "Pollination by nocturnal Lepidoptera, and the effects of light pollution: a review." Ecological entomology 40.3 (2015): 187-198.